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Abstract —In this paper, we investigate a largely extended 
version of classical MAB problem, called networked combinatorial 
bandit problems. In particular, we consider the setting of a 
decision maker over a networked bandits as follows: each time 
a combinatorial strategy, e.g., a group of arms, is chosen, and 
the decision maker receives a reward resulting from her strategy 
and also receives a side bonus resulting from that strategy for 
each arm’s neighbor. This is motivated by many real applications 
such as on-line social networks where friends can provide their 
feedback on shared content, therefore if we promote a product 
to a user, we can also collect feedback from her friends on that 
product. To this end, we consider two types of side bonus in 
this study: side observation and side reward. Upon the number of 
arms pulled at each time slot, we study two cases: single-play and 
combinatorial-play. Consequently, this leaves us four scenarios 
to investigate in the presence of side bonus: Single-play with 
Side Observation, Combinatorial-play with Side Observation, 
Single-play with Side Reward, and Combinatorial-play with Side 
Reward. For each case, we present and analyze a series of zero 
regret polices where the expect of regret over time approaches 
zero as time goes to infinity. Extensive simulations validate the 
effectiveness of onr results. 

I. Introduction 

A multi-armed bandits problem (MAB) problem is a ba¬ 
sic sequential decision making problem defined by a set of 
strategies. At each decision epoch, a decision maker selects 
a strategy that involves a combination of random bandits 
or variables, and then obtains an observable reward. The 
decision maker learns to maximize the total reward obtained 
in a sequence of decisions through history observation. MAB 
problems naturally capture the fundamental tradeoff between 
exploration and exploitation in sequential experiments. That 
is, the decision maker must exploit strategies that did well in 
the past on one hand, and explore strategies that might have 
higher gain on the other hand. MAB problems now play an 
important role in online computation under unknown envi¬ 
ronment, such as pricing and bidding in electronic commerce 
[?], [?], Ad placement on web pages [?], source routing in 
dynamic networks [?], and opportunistic channel accessing in 
cognitive radio networks [?], [?]. In this paper, we investigate 
a largely extended version of classical MAB problem, called 
networked combinatorial bandit problems. In particular, we 
consider the setting of a decision maker over a networked 
bandits as follows: each time a combinatorial strategy, e.g., a 
group of arms, is chosen, and the decision maker receives a 
direct reward resulting from her strategy and also receives a 


side bonus (either observation or reward) resulting from that 
strategy for each arm’s neighbors. 

In this study, we take as input a relation graph G that 
represents the correlation among K arms. In the standard 
setting, pulling an arm i gets reward and observation j, 
while in the networked combinatorial bandit problem with 
side bonus, one also gets side observation or even reward 
due to the similarity or potential influence among neighboring 
arms. We consider two types of side bonus in this work: 
(1) Side-observation: by pulling arm i at time t one gains 
the direct reward associated with i and also observes the 
reward of her neighboring arms. Such side-observation [?] is 
made possible in settings of on-line social networks where 
friends can provide their feedback on shared content, therefore 
if we promote a product to a user, we can also collect 
feedback from her friends on that product; (2) Side-reward: in 
many practical applications such as recommendation in social 
networks, pulling an arm i not only yields side observation on 
neighbors, but also receives extra rewards. That is by pulling 
arm i one gains the reward associated with i together with 
her neighboring arms directly. This setting is motivated by the 
observation that users are usually influenced by her friends 
when making purchasing decisions. [?]. 

Despite of many existing results on MAB problems against 
unknown stochastic environment [?], [?], [?], [?], [?], their 
adopted formulations do not fit those applications that involve 
either side bonus or exponentially large number of candidate 
strategies. There are several challenges facing our new study. 
First of all, under combinatorial setting, the number of can¬ 
didate strategies could be exponentially large, if one simply 
treats each strategy as an arm, the resulting regret bound is 
exponential in the number of variables or arms. Traditional 
MAB assumes that all the arms are independent, which is inap¬ 
propriate in our setting. In the presence of side bonus, how to 
appropriately leverage additional information in order to gain 
higher rewards is another challenge. To this end, we explore a 
more general formulation for networked combinatorial bandit 
problems under four scenarios, namely, single/combinatorial 
play with side observation, single/combinatorial play with side 
reward. The objective is to minimize the upper bound of regret 
(or maximize the total reward) over time. 

The contributions of this paper are listed as follows: 

• For Single-play with Side Observation case, we present 
the first distribution-free learning (DFL) policy, whose 


time and space complexity are bounded by 0{K). Our 
policy achieves zero regret that does not depend on Amin, 
the minimum distance between the best static strategy and 
any other strategy. 

• For Combinatorial-play with Side Observation case, we 
present a learning policy with zero regret. Compared with 
traditional MAB problem without side bonus, we reduce 
the regret bound significantly. 

• For Single-play with Side Rewards case, we develop a 
distribution-free zero regret learning policy. We theoret¬ 
ically show that this scheme converges faster than any 
existing method. 

• For Combinatorial-play with Side Rewards case, by as¬ 
suming that the combinatorial problem at each deci¬ 
sion point can be solved optimally, we present the first 
distribution-free zero regret policy. 

We evaluate our proposed learning policy through extensive 
simulations and simulation results validate the effectiveness 
of our schemes. 

The remainder of this paper is organized as follows. We first 
give a formal description of networked combinatorial multi¬ 
armed bandits problem in Section HI] We study Single-play 
with Side Observation case in Section HIH In Section HVl we 
study Combinatorial-play with Side Observation case. Single¬ 
play with Side Rewards case has been discussed in Section IV] 
In Section IVll we study Combinatorial-play with Side Rewards 
case. We evaluate our policies via extensive simulations in 
Section IVIII We review related works in Section IVIIII We 
conclude this paper, and discuss limitations as well as future 
works in Section |I3 Most notations used in this paper are 
summarized in Table U 

II. Models and Problem Formulation 

In the standard MAB problem, a AT-armed bandit problem 
is defined by K distributions Vi^.-.^Vk, each arm with 
respective means ..., ^k- When the decision maker pulls 
arm i at time t, she receives a reward Xi^f We assume all 
rewards G > 1} are independent, and all 

{Vi} have support in [0,1]. Let i = 1 denote the optimal arm, 
and Ai = Hi — Hi be the difference between the best arm and 
arm i. 

The relation graph G = (V, E) over the K arms de¬ 
scribes the correlations among them, where an undirected link 
e{i,j) G E indicates the correlation between two neighboring 
arms i and j. In the standard setting, pulling an arm i gets 
reward and observation Xi^t, while in the networked combi¬ 
natorial bandit problem with side bonus, one also gets side 
observation or even reward from neighboring arms due to the 
similarity or potential influence among them. Let N{i) denote 
the set of neighboring arms of arm i and Ni = {i} \jN{i). In 
this work, we consider two types of side bonus: 

• Side observation: by pulling arm i at time t one gains 
the reward t associated with i and also observes the 
reward Xj t of i’s neighboring arm j G Ni. This is 
motivated by many real applications, for example, in 
today’s online social network, friends can provide their 


feedback on shared content, therefore if we promote a 
product to one user, we can also collect feedback from 
her friends on that product; 

• Side reward: by pulling an arm i not only yields side 

observation on neighbors, but also receives rewards from 
them, i.e., the total rewards would be This 

setting is motivated by the observation that in many 
practical applications such as recommendation in social 
networks, users are usually influenced by her friends 
when making purchasing decisions. 

Upon the number of arms pulled at each time slot, we will 
study single-play case and combinatorial-play case. 

• In the single-play case, the decision maker selects one 
arm at each time slot, e.g., traditional MAB problem 
belongs to this category; 

• In the combinatorial-play case, the decision maker re¬ 
quires to select a combination of M{M < K) arms that 
satisfies given constraints. One such example is online 
advertising, assume an advertiser can only place up to 
m advertisements on his website, he repeatedly selects a 
set of m advertisements, observes the click-through-rate, 
with the goal of maximizing the average click-through- 
rate. This problem can be formulated as a combinatorial 
MAB problem where each arm represents one adver¬ 
tisement, subject to the constraint that one can play at 
most m arms at each time slot. In the combinatorial 
case, at each time slot t, an M-dimensional strategy 
vector Sa; is selected under some policy from the feasible 
strategy set F. By feasible we mean that each strategy 
satisfies the underlying constraints imposed to F. We use 
X = 1,..., |F| to index strategies of feasible set F in the 
decreasing order of average reward e.g., Si has the 
largest average reward. Note that a strategy may consist 
of less than M random variables, as long as it satisfies 
the given constraints. We then set i = 0 for any empty 
entry i. 

In either case, the objective is to minimize long-term regret 
after n time slots, defined by cumulative difference between 
the received reward and the optimal reward. 

Consequently, this leaves us four scenarios to investi¬ 
gate: Single-play with Side Observation, Combinatorial-play 
with Side Observation, Single-play with Side Reward, and 
Combinatorial-play with Side Reward. We then describe the 
problem formulation for each case. We use It to denote index 
of selected arm (resp. strategy) by the decision maker at time 
slot t, and subscript 1 to denote the optimal arm (resp. strategy) 
in the four cases. We evaluate policies using regret, which 
is defined as the difference in the total expected reward (over 
n rounds) between always playing the optimal strategy and 
playing arms according to the policy. We say a policy achieves 
zero regret if the expected average regret over time approaches 
zero as time goes to infinity, i.e., 9I„/n —0 as n — c». 

1) Single-play with Side Observation (SSO). In this case, the 
decision maker pulls an arm i, observes all X^ t, j G Ni, 
and gets a reward X^ t. The regret by time slot n is written 




TABLE I 

Summary of notations 


Variable 

K 

M 

G 

Xi,t 

Ni 

Ai 

Oi^t 

Oj,t 

Xi,t 

H 

C 

F 

Rx,t 

N 

CBx,,t 

Xx: 

^min 


Meaning 
number of arms 
number of selected arms 
relation graph over the arms 
observation/direct reward on arm i at time t 
mean of Xi^t 

set of neighboring arms of arm i 
the distance between the best strategy and strategy i 
side reward received by arm i from Ni 
number of observation times on arm i by time t 
number of update times on side rewards of arm i by time t 
time averaged value of observation on arm i by time t 
vertex-induced subgraph of G composed by arms with Ai > So 
clique cover of H 

feasible strategy (arm or corn-arm) set 
direct reward on corn-arm x at time t 
mean of Rx,t 

set of neighboring arms of component arms in corn-arm x 
maximum of \Yx\ among all corn-arms 
combinatorial side reward received by corn-arm x from Yx 
the distance between the best strategy and strategy x 
minimum of Ax among all strategies 


as, 

n n 

( 1 ) 

t=l t=l 

Here It denotes the index of arm played at t. 

2) Combinatorial-play with Side Observation (CSO). Rather 
than pulling a single arm, the decision maker pulls a set 
of arms, s/^, receives a reward 

Ri,,t = Y. 

iGs/j 

and also observes reward Xj^t for each neighboring arm 

j G Yi_^, where Yj^ = is the set of neighboring 

arms for selected strategy It- Therefore, let Ai denote the 
expected reward from the optimal strategy, the regret is 
defined as 

n n 

( 2 ) 

t=i t=i 

3) Single-play with Side Rewards (SSR). When pulling an 
arm i, it yields a total reward 

= Yl 

RiNi 

Therefore, the best arm shall be the one with the maxi¬ 
mum expected total reward. Let Ui = denote 

the mean of reward for arm i, and ui the maximum 
reward. The regret is 

n n 

^n=Y^^ (3) 

t=l t=l 

Note here, the optimal arm may differ from the optimal 
arm under single-play with side observation. 


4) Combinatorial-play Side Rewards (CSR). Different from 
combinatorial-play with side observation, the decision 
maker directly obtains the rewards from all neighboring 
arms. That is, the totally received reward includes direct 
reward by strategy x and side reward by its neighbors. 
Let Yx = Ui^sx^i be the set of neighboring arms for 
strategy x, and ax = Ri be the expected reward 

of Sa;. The combinatorial reward at time slot t is written 
as CBj^ t = Thi^Yi We define the regret as 

n n 

= (4) 

t=i t=i 

III. Single-play with side observation 

We start with the case of Single-play with Side Observation. 
In this case, the decision maker learns to select an arm (resp. 
strategy) with maximum reward, meanwhile observes side 
information of its neighbors defined in relation graph. Our 
proposed policy, which is the first distribution free learning 
policy for SSO reffered to as DFL-SSO, is shown in Algo¬ 
rithm [T] As shown in Line 2-5, the decision maker updates all 
neighbors’ side information, i.e., number of observation up to 
current time, and time-averaged reward. The key idea behind 
the algorithm is that side-observation potentially reduces the 
regret as the decision maker can explore more without pain, 
thus gain more history information to exploit. 

To theoretically analyze the benefit of side observation, we 
novelly leverage the technique of graph partition and clique 
cover. The basic idea in standard analysis of regret bound 
with side observation in distribution-dependent case is to use 
clique cover of relation graph, and use the arm with maximum 
Ai inside each cilque to represent the clique for analysis. 
While standard proof of distribution-free regret bound is to 
divide the arms into two sets via a threshold Acq on A^, 
and then respectively analyze the bounds of the two sets of 
arms. Therefore, to obtain a distribution-free result, we cannot 
directly use the arm with maximum A^ inside a clique for 
representation to prove distribution-free regret bound, as the 
arms with A^ smaller than Acg are distributed inside cliques. 
To address this issue, we first partition the relation graph G 
using the predefined threshold, and then mainly analyze the 
benefit of side observation in one vertex-induced subgraph H 
for arms having Ai above Acq. In the subgraph H, it is then 
possible to analyze the distribution-free regret bound using the 
technique of clique cover. 

Theorem [T] quantifies the benefit brought about by it, where 
it shows that the more side observation (e.g., smaller clique 
number) is, the smaller the upper bound of regret is. 

Theorem 1: The expected regret of Algorithm [T] after n time 
slots is bounded by 

< Ib.Ms/nK + Q.7ACs/n/K, (6) 

where C is clique cover of vertex-induced subgraph H with 
arms of A^ above threshold Sq in relation graph G. 

Proof: The proof is based on our novel combination of 
graph partition and clique cover. We first partition relation 








Fig. 1. Graph partition: G is relation graph, and H is vertex-induced graph 
that is covered by 3 cliques 

Algorithm 1 Distribution-Free Learning policy for single-play 
with side observation (DFL-SSO) 


1: 

Eor each time slot f = 0,1,..., n 



Select an arm i by maximizing 



- 1 {log{t/{KO.A) 

y Oi,t 

(5) 


to pull 


2: 

for fc G Vi do 


3: 

Ok,t+l ^ Ok,t -b 1 


4: 

^k,t+l Xk,t/Ok,t -b (1 — i^/Ok,t)^k,t 


5: 

end for 


6: 

end for 



lK(n) < nAco -b 


< y^Ec(n). 


cec 


vertex-induced subgraph H of K 2 is covered by a minimum of 
3 cliques, respectively marked by black, gray and dash lines. 
2. Regret analysis for regret of subgraph H 
In the rest part, we focus on proving upper bound of regret 
Let = max^gc A^, and Tcit) = Y.i^c ^*(0 denote 
the number of times (any arm in) clique c has been played up 
to time t, where Ti{t) is the number of times arm i has been 
selected up to time t. Similarly, we suppose that cliques are 
ordered in the increasing order of Ac. Let Vj = pi — 
for cliques in K 2 , cq < j < K, and Vco = /ri — Let 
Zcf, = -boo and ^k+i = -boo. For better description, we use 
Co to denote the case of c = 0. 

As every arms in a clique c must be observed for the same 
number of times, then for each clique and Zq > 0, we have 


^Rc = ^ ^iTiin) < lo max Ai -b ^^l{/t=i,f>Zo} (10) 

i^C I—Iq 


i^c 

Meanwhile, 


K 


mH{n) = J2^c = J2^oAc + J2A,T'in), ( 11 ) 


ceif ceC 


graph to rewrite regret in terms of cliques, and then mainly 
tighten the upper bound by analyzing regret of cliques. 

1. Partition relation graph and rewrite regret of sub¬ 
graph H in terms of cliques. 

We order the arms in an increasing order of A^. We use 
^co A 1^0 = Oi\JKjn < Aco+i to split the K arms into two 
disjoint sets, one set Ki with Ax < Acq and the other set K 2 
with Aa; > Aco (We will set the value of a in later analysis). 
Let Co be the smallest index of arm satisfying A^ < Ac^. We 
remove all arms in iLi from the relation graph G, as well as 
adjacent edges to nodes in Ki. In this way, we get a subgraph 
H of G, over arms in K 2 . The regret satishes. 


Where r/(n) denotes the number of arm i played after t = Iq, 
and we refer to the second term as 93', 


H 


Dehne 


W = min Wi t, 

l<t<n ’ 


( 12 ) 


and 


(13) 

We have the following for 93^ (n). 


K 


93'^(n) = ^A,T'(n) 


(14) 


K C 


(7) 


j=co i=l i=co i=j+l 


where 93/r(n) is regret generated by selecting suboptimal arms 
in K 2 . 

Consider a clique covering C of H, i.e., a set of cliques 
such that each c € C is a clique and V = Ucgcc. We dehne 
the clique regret 93c (n) for any c € C by 

t<n i£c 

Since the set of cliques covers the whole graph H, we have 


For the hrst term of Equation ( fTSl ), we have: 


K j K 

< El 

j=co i=l 


we[vj+i,vj)nAj 


(16) 


J=C0 


= nAr_ 


‘E ^w<vA^c - Ac-i)(17) 


(9) 


We give an illustration of the partition process in Fig. [T] 
where the relation graph G contains one small set of blue 
nodes representing Ki with Ai below Ac^, and the other 
large set of white nodes denoting K 2 with A^ above A^g . The 


We have the hrst equation as Aj > A^ and < n. 

To bound the second term of Equation ( fTSl l. we record 

n = {mint : < Vi} (18) 

after ig- To pull a suboptimal arm i at t, one must have Wi^t > 
> W. By Algorithm [T] we have {W > Vi} C {T'(n) < 
Ti}, since once we have pulled times arm i its index will 
always be lower than the index of arm 1. 










Therefore, we have 


Now we prove to bound P{W < Vc){^c — Ac_i). 


K 


9f(n) < 2nAco + ^/qAc + ^ AjE(ri|t >/o) 


Recall that Acg < i5o < Acq+i, and let ^co be Ac=o- Taking 

Ac 
2 


P{W < fii — as an nonincreasing function of Ac, we 


cGC 


i=l 


have 


Trt ^ lw<tJc(^c Ac—i). 


C=1 


For any Iq > 0, 


AiE(Ti|ri > Iq) 
+ 00 

I—Iq 
+ 00 

I —In 


(19) 


( 20 ) 


^P(l^<t;c)(Ac-Ac_i) 

C^l 

<6o-A,,+ [ 1P{W < Pi - ^)du. (33) 

JSn ^ 


< gpfe,-,.>^-v/bs±<^) 

i = io ^ 


For a fixed u G [()o,l] and f{u) = 81og(A/n/Ar'«)/M^, we 
have 


P(VF<Pi- 2 ) 


— login/iKl)) u 

= P ( 31 < Z < n : Ai_/ + \/-^- < pi — — 


( 21 ) 


< P 


Let Iq = 8 log (-^A^)/A^. For I > Iq, we have 

log+(f/(iT/)) < log+(7r/(iT/o)) <i^x^) (22) 

< (23) 

- 8 - 8 

Therefor, we have 


(31 <i </(.):„-Tu>\/AL^) 


+P ^31 < I < f{u) : Pi - Ai,i > ^ 


(34) 


A, ^ /log+(n//T0 ^ Ai A, 


„ > —- i==aAi (24) 

2 V / - 2 78 


Let Pi denote the first term of (|34] |, using the form of 
ji+Tf{u) < I < ^f{u), we have 






AJq < 81og(-A2)/A, <-vW^ (25) 

K e 

To bound (1211 1 using Hoeffding Bound, i.e., 

+ 00 

E{rj|t > Iq} < ^ - Pj > aAi) (26) 

l=lo 

+00 

< exp {—2l{aAi)'^) (27) 

I—Iq 

= ^ 1 - 2/o(aAi)^ 

1 _ exp(-2(aA,)2) ^ ^ 

I—to 


/(u)2-(™+i)log(^) 


< y^exp[-2 


= 2 


m—1 

Kf{u) 


fiu)2- 


(35) 


Let P 2 denote the first term of (l34l i, using the form of 
2 '"/(m) < I < 2’"+^/(m), we have similarly. 


P 2 < P ( 32”"/(u) < I < 2™+V(u) : 


< 


< 


1 


1 — exp(—2(aAi)^) 

1 

(2aA,)2 - (-2(aA02) 

1 


2aAj (1 — a^)' 


(29) 


(30) 


(31) 




< ^ exp ( -2 


m—0 


,{ 2^-^f{u)uy 

/(u)2"*+i 


Then we have 

AiE{ri|f > Zo} < 81 og(-^Ai)/Ai + 


?T- . 2i ' * 1 


< 

< 


1 


exp(/(M)M2/4) _ 1 
1 


nv? jK — 1 


(36) 


'K 


2aAi(l — a^) 


=5 


-1 


The last inequality comes from /(m) is upper bounded by 
4n/(eiT). 


































By taking integrity on Pi and P 2 , we respectively have 

(37) 


/ Pidu < n - / f{u)d 

J 5q ^ J 5q 


2K 


= n- 


8 \og{e^/njKu) 


So 
J 1 




and 


f 1 

n / P 2 du < - log 
Jso 2 


a + 1 
a — 1 


V nK. 


(38) 


(39) 


Instantly we have 

c 

n'^P{W < 'Uc)(Ac - Ac-i) 


c=0 

< n{So — Aco) + 


81og(eQ;) 1 , f a + 1 

+ IT log' 
a 2 


) VnK 


Finally, we get the regret bounded by 
Tin < ^ ^ \/n/K + ^3a 


cGC 


a — 1J J 


8 loglea) 1 , f a + I 

-olog -7 

a 2 V a — 1 


+ 


-1 


2a(l — a^) ^ 
e, and we 

Tin < 15.94v^ + O.lACy'n/K. 


(40) 


Let a = e, and we already have a = ^ then 


(41) 


IV. Combinatorial-play with side observation 

In this section, we consider combinatorial-play with side 
observation. In this case, an intuitively extension is to take 
each strategy as an arm ( we name it com-ann), and then 
apply the algorithm for SSO to solve the problem. However, 
the key question is how to utilize the side-observation on arms 
defined in relation graph to gain more observation on corn- 
arms, that is, how to define neighboring corn-arms. To this 
end, we introduce the concept of strategy relation graph to 
model the correlation among corn-arms, by which we convert 
the problem of CSO to SSO. 

The construction process for strategy relation graph is 
as follows. We define strategy relation graph SG{F, L) for 
strategies in F, where F is vertex set, and L is edge set. Each 
strategy is denoted by a vertex, and a link 1 = {s^,Sy) in L 
connects two distinct vertexes s^; and Sy if Sy G and vice 
versa. The neighbor definition for strategies is natural as once 
a strategy is played, the union of neighbors of arms in this 
strategy could be observed according to neighbor definition 
for arms in G, which surely reward of any strategy composed 
by these observed arms is also observed. We give an example 
in Fig. I 2 ] There are 4 arms in relation graph G, indexed by 
i = 1,2,3,4. The combinatorial MAB problem is to select a 
maximum weighted independent set of arms where unknown 



Ni={1,2} 

N2={1,2,3} 

N3={2,3,4} 

N4={3,r} 



s,={1} 

S2={2} 

S3={3} 

S4={4} 

S5={1,3} 

S6={1,4} 

S7={2,4} 


Fig. 2. Convert combinatorial-play to single-play: constructing strategy 
relation graph SG{F, L) based on arm relation graph G 


bandit is weight. As shown in Fig. |2] the feasible strategy 
set for this problem consists of 7 feasible strategies, i.e., 
independent sets of arms in G; 

51 ={l},U,es3A, = {1,2} 

52 = {2}, = (I; 2, 3} 

5 3 = {3}, Uigs3 Ai = (2,3,4} 

5 4 = {4}, Uigs4 A; = {3,4} 

55 = {1,3},U,6,,A, = 11,2,3,4} 

se = (1) 4}, Uigse Ai = (1, 2, 3,4} 

S7 = {2,4},Uies7A. = 11,2,3,4} 

Taking S2 and S5 for illustration, the component arms of 
S 2 , i.e., { 2 }, is a subset of UiggsA^ = ( 1 , 2 ,3,4}, and 
the component arms of S5, i.e., {1,3} is also a subset of 
Uigs 2 Ai = {1,2,3}. Therefore, the two strategies are con¬ 
nected in the relation graph SG. 

Consequently, we can convert the combinatorial-play MAB 
with side observation to a single-MAB with side observation. 
More specifically, taking each strategy as an arm, SG{F, L) 
is exactly a relation graph for corn-arms in F. The problem 
turns into a single-play MAB problem where at each time 
slot the decision maker selects one corn-arm from \F\ ones to 
maximize her long-term reward. 

The algorithm is shown in Algorithm |2l and we derive the 
regret bound below directly. 

Theorem 2: The expected regret of Algorithmic] after n time 
slots is bounded by 

< 15.94\/4^ -f Q.14CiJn/\F\. (43) 

In the traditional distribution-free MAB by taking each corn- 
arm as an unknown variable [?], the regret bound would 
be AQ^Jn\F\. Our theoretical result significantly reduces the 
regret and tightens the bound. 





















Algorithm 2 Distribution-Free Learning policy for 
combinatorial-play with side observation (DFL-CSO) 

1: For each time slot t = 0,1,... ,n 
Select a corn-arm s^; by maximizing 


Rx 


I log {t/{KOa;,t)) 


o 


X,t 


(42) 


Algorithm 3 Distribution-Free Learning policy for single-play 
with side reward (DFL-SSR) 

1: For each time slot f = 0,1,..., n 
Select an arm i by maximizing 



log {t/(KOI)) 


ol 


(45) 


to pull 

2: UPDATE: for y € do 2 

3: Oy^t+1 Oy^t + 1 3 

4 : Ry,t+l Ry,t/Oy^t + (1 ~ llOy^t)Ry,t ^ 

5: end for 5 

6: end for 6 

7 

8 

V. Single-play with side rewards 9 


to pull 

for k G Ni do 

Ok,t+i ^ Ok,t + 1 

if minjgjVfc Oj t is updated 

0^+1 = ^k,t + 1 

Bk,t+i = Bk,t/Ol, + (1 - llOlt)Bk,t 

end if 
end for 
end for 


Though the single-play MAB with side reward have the 
same observation as the single-play MAB with side obser¬ 
vation, the distinction on reward function makes the problem 
different. In the case of SSR, the reward function is side reward 
of the selected arm It, instead of its direct reward. Here we 
treat the side reward of each arm as a new unknown random 
variable, i.e., we require to learn B^ t that is a combination of 
all direct rewards in N^. As direct rewards of arms in are 
observed asynchronously, we cannot update the observation 
on Bi^t as the way in SSO where observation is symmetric 
between two neighboring nodes. The trick is updating the 
number of observation on Bi^t only when direct rewards of all 
arm in Ni are renewed. We use ^ to denote this quantity to 
differ from Oi^t which denotes the number of direct reward is 
observed. Therefore, whenever an arm is played or its neighbor 
is played, the number of observation on side reward t can 
be updated only when the least frequently observed arm in Ni 
is updated. That is, 

0>> = I ^ minjgAT, Oj,t is updated 

** Otherwise. 

The algorithm for single-play MAB with side reward is sum¬ 
marized in Algorithm|3]where we directly use side reward Bi t 
as observation, and update O^t according to (l44l i. The regret 
bound of our proposed algorithm is presented in Theorem |3] 
Theorem 3: The expected regret of Algorithm [3 after n time 
slots is bounded by 

Tin < A9KV^ (46) 

Proof: In this case, Bi^t G [0, K], which indicates that 
the range of received reward is scaled by K at most. We 
normalize Bi t G [0,1]. Using the same techniques in proof of 
MOSS algorithm [?], we get the normalized regret bound, and 
then the regret bound in (l46l l by scaling the normalized regret 
bound by K. In Algorithm |3l the number of observation times 
on side reward should be no less than the scenario without 
side observation. Therefore, Algorithm [3 would convergence 
to the optimality faster than the MOSS algorithm without side 
observation. ■ 


VI. Combinatorial-play with side rewards 
Now we consider the combinatorial-play case with side 
reward. Recall that in this scenario, it requires to select a corn- 
arm Sa; with maximum side reward, where the side reward 
is the sum of observed rewards of all arms neighboring to 
arms in s^;. The case is more complicated than previous three 
cases, due to: 1) Asymmetric observations on side reward 
for neighboring nodes in one clique; 2) Probably exponential 
number of strategies caused arbitrary constraint. Therefore, it 
is complicated to analyze the regret bound if adopting the 
same techniques of combinatory-play with side observation. 
Instead of learning side reward of strategies directly, we learn 
the direct reward of arms that compose corn-arms. 


Algorithm 4 Distribution-Free Learning policy for 
combinatorial-play with side reward (DFL-CSR) 

1: For each time slot t = 0,1,... ,n 
Select a corn-arm s^; by maximizing 


E 

ieYx 


Ni t 


\ 


max(ln;^^,0) 


O, 




to pull 

for fc € Ur do 

^,t+l ^ ^k,t + 1 _ 

Xk,t+1 = Xk,t/Ol, + (1 - llOlt)Xk,t 

end for 
end for 


(47) 


Theorem 4: The expected regret of Algorithm |4] after n time 
slots is bounded by 


in(n) < NK + + S{1 +N)N^'^ 


n3 


-|-(1 -I- )N^KnB. 

e 

where N < K is the maximum of |Ue|, x = 1... \F\. 
Proof: See Appendix. 


(48) 

























(a) Expected regret (b) Accumulated regret 

Fig. 3. Comparison of regret: MOSS v.s. DFL-SSO 




(a) Sparse relation graph 


(b) Dense relation graph 


Fig. 4. Expected regret of DFL-CSO 


VII. Simulation 

In this section, we evaluate the performance of the proposed 
4 algorithms in simulations. We mainly analyze the regret 
generated by each algorithm after a long time slot n = 10000. 

We first evaluate regret generated by DFL-SSO, and com¬ 
pare with MOSS learning policy. The experiment setting is 
as follows. We randomly generate a relation graph with 100 
arms, each following an i.i.d random process over time with 
mean between [0,1]. We then plot the accumulated regret and 
expected regret over time, as shown in Fig. |3(a)| Though the 
expected regret over time by MOSS converges to a value 
around 0 that coincides with its theoretical bound in Fig. |3(a)| 
it shows that its accumulated regret grows dramatically. It 
is oblivious the proposed algorithm with side information 
performs much better than MOSS, e.g., the accumulated regret 
and expected regret of our proposed algorithm (DFL-SSO) 
both converge to 0. 

For other 3 algorithms, as we first study the 3 variants of 
MAB problem, there are no candidate algorithms to compare. 
We show the trend of expected regret over time for each 
case. In evaluation of Algorithm 2, we note that the regret 
bound contains the terms: number of corn-arms and number 
of cliques. The upper bound becomes huge if the number 
of corn-arms is voluminous, and a small clique number can 
significantly reduce the bound. In order to investigate the 
impact experimentally, we then test for regret both under 
sparse relation graph and dense relation graph. In Fig. |4(a)| 
where the arms are uniformly and randomly connected with a 
low probability of 0.3, it shows that the expected regret slowly 
increases beyond 0. While in Fig. |4(b)| where the arms are 
uniformly and randomly connected with a higher probability 



Fig. 5. Expected regret of DFL-SSR 



Time slot 


Fig. 6. Expected regret of DFL-CSR 

of 0.6, it shows that the expected regret gradually approaches 
0. It implicates that the side observation indeed helps to reduce 
regret if one can observe more, even for the case that previous 
literature show that it will introduce exponential regret by 
learning each individual corn-arm of a huge feasible strategy 
set [?]. The simulation results for Algorithm 3 and 4 are shown 
in Fig. |5] and |6] where the expected regret in both hgures 
converges to 0 dramatically. 






























VIII. Related works 

The classical multi-armed bandit problem does not assume 
that existence of side bonus. More recently, [?] and [?] 
considered the networked bandit problem in the presence of 
side observations. They study single play case and propose 
several policies whose regret bound depends on Amin, e.g., 
an arbitrarily small Amin will invalidate the zero-regret result. 
In this work, we present the first distribution free policy for 
single play with side observation case. 

For the variant with combinatorial play without side bonus, 
Anantharam et al. [?] firstly consider the problem that exactly 
N arms are selected simultaneously without constraint among 
arms. Gai et al. recently extend this version to a more 
general problem with arbitrary constraints [?]. The model 
is also relaxed to a linear combination of no more than N 
arms. However, the results presented in [?] are distribution- 
dependent. To this end, we are the first to study combinatorial 
play case in the presence of side bonus. In particular, for the 
combinatorial play with side observation case, we develop a 
distribution-free zero regret learning policy. We theoretically 
show that this scheme converges faster than existing method. 
And for the combinatorial play with side reward case, we 
propose the first distribution-free learning policy that has zero- 
regret. 

IX. Conclusion 

In this paper, we investigate networked combinatorial bandit 
problems under four cases. This is motivated by the existence 
of potential correlation or influence among neighboring arms. 
We present and analyze a series of zero regret polices for 
each case. In the future, we are interested in investigating 
some heuristics to improve the received regret in practice. For 
example, at each time slot, instead of playing the selected 
arm/strategy with maximum index value (Equation Q, (l42li '). 
we will play the arm/strategy that has maximum experimental 
average observation among the neighbors of It- Therefore, we 
ensure that the received reward is better than the one with 
maximum index value. 

X. Appendix 
A. Proof of Theorem 0 

To prove the theorem, we will use Chernoff-Hoeffding 
bound and the maximal inequality by Hoeffding [?]. 

Lemma 1: (Chernoff-Hoeffding Bound [?]) are 

random variables within range [0,1], and = 

ft, VI < f < n. Let Sn = Cu all a > 0 

P(5'n > nfi + a) < exp (—2a^/n), 

P(Sn < nfjL — a) < exp {—2a^/n). (49) 

Lemma 2: (Maximal inequality) [?] are i.i.d 

random variables with expect /i, then for any y > 0 and n > 0, 

P^Br G 1,... ,n,^(p- G) > y') < exp(-^). (50) 

Each corn-arm s^; and its neighboring arm set actually 
compose a new corn-arm, which could be denoted by as 


Sx C Yx. Each new corn-arm Yx corresponds to a unknown 
bonus CBx,t with mean ax- Recall that we have assumed 
ai > ••• > cr\p\. As corn-arm Yi is the optimal corn-arm, 
we have A^, = cti — ax, and let Zx = ai — We further 
define Wi = mini<t<„kFi t. We may assume the first time 
slot 2 = argmini<j<„ Wi^t- 

1 . Rewrite regret in terms of arms 
Separating the strategies in two sets by A^;;, of some corn- 
arm Sa;(,(we will define xq later in the proof), we have 

xq |U| 

^xE[Tx^n] + ^xE[Tx^n] 

X — 1 X—Xq-\-1 

\F\ 

< ^xo'’^~\~ /S.xE\Tx,ri\- (51) 

X — Xq-\-1 

We then analyze the second term of (ISTT i. As there may be 
exponential number of strategies, counting Tx,n of each corn- 
arm by the classic upper-confldence-bound analysis yields 
regret growing linearly with the number of strategies. Note 
that each corn-arm consists of N arms at most, we can rewrite 
the regret in terms of arms instead of strategies. We then 
introduce a set of counters {Tx,n\k = At each 

time slot, either 1) a corn-arm with A^; < A^^^ or 2) a corn- 
arm with Aa; > Axq is played. In the first case, no Tx^n will 
get updated. In the second case, we increase Tx^n by 1 for any 
arm k = argmin^gy^ {Oj t}. Thus whenever a corn-arm with 

Ax > Axg is chosen, exactly one element in {TJc „} increases 
by 1. This implies that the total number that strategies of 
Ax > Axq have been played is equal to sum of all counters in 

{fx,n}, i-e., Y}f=xa +1 = J2k=i ^x,n- Thus, we Can 

rewrite the second term of (BTl) as 

If’l ir’l K 

AxE\Tx,n\ < Ax ’^2, E\Tx,n\ < Ax E\Tx,n\- 

x = xq-\-1 x—xq-\- 1 k = l 

(52) 

Let Ik,t be the indicator function that equals 1 if is 
updated at time slot t. Define the indicator function l{j/} = 1 
if the event y happens and 0 otherwise. When Ik^t = 1, a 
corn-arm Yx with x > xn has been played for which Ok t = 
min{0,,:VjGrj.Then 

n 

fx,r. = X] l{Ik,t = 1} (53) 

t=l 

n 

< Y, < Wx,t} (54) 

t = l 

n 

< Y HWi < Wx,t} (55) 

t=l 

n 

< Y HWi < Wx,t, Wi > Zxj (56) 

t = l 

n 

+ YHWi < Wx,t,Wi < Zx} (57) 

t=l 

= Tfc.n + T^,n- (58) 

We use and to respectively denote Equation (|56] | 
and (l57l i for short. Next we show that both of the terms are 
bounded. 


2. Bounding „ 

Here we note the event {Wi > Z^:} and {Wx,t > Wi} 
implies event {Wx^t > Zx}. Let ln+(?/) = max(ln(y), 0). For 
any positive integer Iq, we then have, 


tU < E ^ 

t=i 

n 

^ Iq ^ ^ l{VFx,t ^ > /o} 

t—lQ 

n 

= ^0 + ^{Wx,t > Zx,Tk,t > ^o} 


^0 + Ep E + \ 

^n(=V^ ^ ' 


t=lo 

2 


1 / *2/3 N 

ln+(:Ko-7) 


> E w + > lo \. 

j^Yx 


The event { Y.,^Yx ( + ^ Eg^x + 


indicates that the following must be true, 


TE<^o + EEPE. 




hn+{tV^/KO,^t) 


t=lo jeYa 

'> _ 1 _ ^3: 


< ^0 + E E p 1 

4=^0 JGF* 


O 




> 


Ax lln+itVyKO,^t) 
2N V Oj,t 


Now we let Iq = 16-/V^|’ln(^^^A^)/A^)] with \y'] 

the smallest integer larger than y. We further set 6q = 
\/K and set xq such that Axq < Sq < A^;^^-!. 
As Oj^t > Iq, 


In- 


,n^A Al ^ ^ loAl ^ Oj,tAl 


< ln+( ^ X ^ 167 V 2 “ 16A2 ' 

Hence we have. 


(59) 

(60) 

(61) 


(62) 


V , /ln+(^"/VAO,,0 ^ Ax 

3j G Yx,Xj^t + \l --> Mj + (63) 


Using union bound one directly obtains: 


(64) 


Lfc JJ < ^ p| ^ cAa 

t—lo jGYx 
n 

< (o + E E exp(-20j,t(cAx)^) 

t=iojeyx 

< lo + K ■ n - exp(—2Zo(cAa;)^) 

Infti—A^) 

= 1 + 16A^-A_—£i _|_ X . 7 T,. exp(—21n(n^e)). 

( 68 ) 

As bg = K/ni and A^ > 5a, the second term in 

is bounded by 

16A2(1+ lnni/i2) 

kI "" ^ 

The last term of dbSl l is bounded by 

J_ K 5 

K ■ n ■ exp(—2ln(n ^^e)) < -^ ■ n<i 
Finally we get 

Ke 

3. Bounding 

n 

fl^ = El{W^i < Wx,uWx < Zx} 
t=l 
n 

< EP{W/i < Zx] < nP{Wi < Zx}. (70) 

t=i 

Remember that at time slot z, we have VFi = minWi_t. For 
the probability {lUi < of fixed x, we have 


P{Wi <^i-^} 


(65) 


-p E 


Wj^z < CTl 


Ax 




j€Ni 




(71) 

(72) 

(73) 


We define function f{u) = eln(Y^^^^u)/u^ for u G 
[bo, A/]. Then we have. 


P) “//.z < 2JV 


( 66 ) 


= phi < I <n lyj - 

^ T = 1 


2N 


E - / ln+(7"/VAO,,t) ^ ^ _ Ax ^ 

2N Y Oj,t - 2N ,/Tqn2 ^ 

1 1 _ 1 


r E_ / ^2/3 /A 

< p| 31 < 1 < n : E(w - Xj,x) > — ) ^ 

'' T=1 


With c = ^ - ^ 7 ^ - 

Therefor, using Hoeffding’s inequality and Equation (l65l l. 
and then plugging into the value of Iq, we get. 


< P|31 < I < /(Ax) : E(f. - A,,x) > Ynn+(^) 

+p|3/(Ax) <l<n-. E(W - Xj,x) > ^}- (74) 




































For the first term we use a peeling argument with a geometric 
grid of the form < I < 


< 


>f^)} 

CO jr I 

E P 3^/(A.) < I < -/(A.) : E(w - 

0 = 0 T = 1 


9=0 

> 


/(A.) r»/. 2 . 


< ^exp( -2 
9=0 


29+1 -^'■KfiA^ 

/(A.)^ln+(^;gA) 


/(A 


29 


<E 

9=0 


Kf{A,) 1 
n2/3 29 


< 


2Kf{A,) 

n^l-i 


(75) 


where in the second inequality we use Lemma |2] 

As the special design of function /(u), we have /(u) takes 

maximum of 3^372 when u = For A^ > 

e^/^yjKlr}/^ , we have 


2A/(Ao 

^ 2/3 


< — 


- 1/6 


(76) 


For the second term we also use a peeling argument but with 
a geometric grid of the form 2^/{A^) < I < 2^^^ f (Ax)'. 

p| 3/(A.) < I > ^| 

^ T = 1 ^ 

00 f I 

< ;^pi 32«/(A,) < I < 2«+V(A.) : 


9=0 


> 


2«-V(A,)Ao 


N 


< E®^p 


9=0 


-2V(A.)A^ 

4A2 


CO ^ 

< ^expf -(p +l)/(Ao,)A^/4A' 

9=0 ^ 

1 

exp(/(Ao;)A2/4A2) - f 
We note that f{u)v? has a minimum of when u = 


(77) 


aiQ. Thus for dTT] ). we further have, 

1 1 

< 


< 


A'/KN'^n 6 


Combining (|7^ and ( fTOl i. we then have 


(78) 


S'/K e e 

4. Results without dependency on Amin 


Summing and have 

rp ^ rjil I /t=!2 

-Lx,n ^ ^ 

= 1 + —r^{l + + 1 + -)AnS 

Ae 15 e 

and using Ax < N and A^, < Sq for x < xq, we have 
lH(n) < VKen'^ + NK 


' 16^2 2 


4+7fw+., , 

+ (1 H-)AnE 


< NK + + 8(1 + N)N^'^ ni 

) 

-)N‘^KnK 


AyfKN^ , . 

+ (1 +- 



































